ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / tsql / doc / tsql.mail / 000079_jcliffor@is-4.stern.nyu.edu _Fri Apr 9 18:37:06 1993.msg < prev next >

Wrap

Internet Message Format | 1996-01-31 | 12KB

Received: from IS-4.STERN.NYU.EDU by optima.cs.arizona.edu (5.65c/15) via SMTP id AA12630; Fri, 9 Apr 1993 15:29:38 MST Received: by is-4.stern.nyu.edu (4.1/1.34) id AA12660; Fri, 9 Apr 93 18:37:09 EDT Date: Fri, 9 Apr 93 18:37:06 EDT From: Jim Clifford <jcliffor@is-4.stern.nyu.edu> To: tsql@cs.arizona.edu Subject: Further Discussion of the Benchmark Data Message-Id: <CMM.0.90.2.734395026.jcliffor@is-4.stern.nyu.edu> Colleagues: Apparently my comments on the notion of a "key" in an historical database, and my concern about the bias of the proposed Benchmark Data in Section 3, were not clearly stated. I will repeat my earlier remarks here, and number the points, because I believe that we can get agreement about some of them, and perhaps have some discussion about others of them. ********************************************************************* (1) I believe that the following criterion should be used in populating the agreed-upon schema with data: the database instance should accord with ALL AND ONLY those constraints which are explicitly stated. ********************************************************************* (2) The proposed database instance violates the AND ONLY part of this criterion in at least the following way (and possibly others): there is an implicit assumption about the meaning of being a "key" that I believe is (i) stronger than necessary, and (ii) at the very least should at least be made explicit. ********************************************************************* (3) The only explicit assumption stated about, e.g., the "key" attribute Name in relation EMP is that it obeys the "snapshot function dependency" Name --> Salary, Dept, Gender, D-birth This means that for all snapshots, no 2 tuples can have the same Name. I assume, therefore, that this is the intended meaning of the use of the term "key". ********************************************************************* (4) However, the proposed database also assumes that the "key" attribute Name is time-invariant. This is a stronger condition than the notion of "key" as a "snapshot function dependency," and biases the proposed benchmark in favor of the tuple-stamped models. ********************************************************************* DISCUSSION ABOUT THESE 4 POINTS: (1) So far, no one has commented on this. I assume it is non-controversial. (3) Ditto -- I gather that everyone views this as the meaning of the term "key" in the context of an historical relation, i.e., that at each time point t, there can be AT MOST 1 TUPLE in the relation with a given value for the key. My points numbered (2) and (4) above were both attempts to make the same point, in a general kind of way in (2) and more specifically in (4). These two points need elaboration. In the Draft Proposal "Section 3.2 The Proposed Data," beginning with the sentence "There are two employees, Ed and Di.", there is an implied connection between these 2 "people" in the real world and the strings and integers with which we populate the database. Specifically,it is implied that a real-world person (let's call him *ED* to distinguish him from the character string "Ed" we store as his Name in the database) is identified with all the tuples in the database storing the character string "Ed" as the value of the NAME attribute The reason this biases the database instance in favor of the ungrouped models is that it obscures a problem with the ungrouped models (as pointed out in [CCT93]), namely that these models do NOT provide as much built-in management of temporal data as the grouped models do. Rather, they either (i) put unnecessary restrictions on the types of data they will manage, or (ii) put an unnecessary burden on the end user. I want to elaborate on these two points, and in so doing I think that I will be answering Christian's two specific questions and/or comments, which I will here label (a) and (b), viz.: (a) It would be of great help if someone (Jim?) could exemplify what is missing in the current instance. What can we add to the instance in order not to violate the AND ONLY criterion? AND (b) The standard relational model and conventional normalization and dependency theory cannot capture the connection between Emp tuples and the real-world persons they represent. There is no real notion of object identity (existence). When the key of a tuple changes its value, there in no way of telling that the tuple still represent the same real-world person. For example, if the tuple for Di, (Di, 30K), is changed to (Jo, 30K) because Di changes name, we cannot capture that both tuples belong to the same person and that Name thus is time-varying. ********************** * The Saga of *ED* * ********************** Let us suppose that on or about 1/1/88 (remember, we are not dealing with Transaction Time yet) our employee *ED*, who previously went by the name "Ed", informs his company that as of 1/1/88 he wants his name to be "Edward". The company is using a Temporal Database. One of 2 possibilities arises at this point. IF the database has a notion of KEY as we seem to agree that it should have, then the database can support this change in Name. If, on the other hand, the database disallows any changes to the value of a key AT ANY TIME (which is a constraint that the Proposed data satisfies, but the Proposed Schema does not require), then *ED* will be told that the Temporal Database system will not allow this. In this case, we are done. We are either happy or unhappy with a database which cannot accurately model the real-world, in this case by being unable to reflect *ED*'s name change. I for one am not happy with a database model that makes this restriction but, if it is made by a model, it should be made explicitly. So, let us suppose that the Temporal Database does NOT have this stronger constraint on a key, i.e., that someone performs whatever update(s) are required in the Temporal Database to reflect *ED*'s name change. Let us consider at 2 types of operations on the Temporal Database that pertain to this change in *ED*'s name, and let us look at them in the context of an Ungrouped Temporal Model and a Grouped Temporal Model. First, the UPDATE to the database to reflect the name change, and second, any subsequent QUERY about *ED* which an end user might ask of the database (*ED* himself not being available to answer the question in person). THE UPDATE in UNGROUPED MODELS. My reading of the literature describing these models would indicate that someone in the organization authorized to make UPDATES would reflect this change by the addition of a tuple to the relation, something like (1/1/88, "Edward", 50K, Male, ...etc.). There might also be some modification to 1 or more tuples already in the relation (to "close" their period of validity, e.g.). THE UPDATE in GROUPED MODELS. In these models, someone in the organization authorized to make UPDATES would attempt to locate the tuple recording information about *ED* by searching for the tuple that satisfies the selection restriction that Name(NOW) = "Ed". Note that by the definition of "key", there can be at most 1 of these. (ASIDE: If no such tuple is found, then perhaps other queries might be attempted, but we are presuming that there is a tuple somewhere in the relation that has *ED*'s data, and it will be found.) Once found, the tuple is updated with the new <1/1/88, "Edward"> information in whatever way the model requires. THE QUERY in UNGROUPED MODELS. For a query abut *ED* to return COMPLETE information about him, in Temporally Ungrouped Models IT IS UP TO THE USER to ask for the union of those tuples in the relation with NAME="Ed" with those tuples in the relation with NAME="Edward". In fact, it is actually worse than this. Our key constraint only says that any particular time t there can be at most one "Ed" or one "Edward" -- however, there could be other "ED"'s or "Edward"'s at other times not in our friend *ED*'s lifespan. So, the user must know ALL of *ED*'s names and WHEN he had them in order to be sure to get all of the information about *ED* in the database. So, a big burden is placed on the end user: the end user must already know a great deal about *ED* in order to find out information about *ED*. THE QUERY in GROUPED MODELS. For a query abut *ED* to return COMPLETE information about him, in Temporally Grouped Models IT IS ONLY NECESSARY for the end user to know the name of *ED* at some point in time (perhaps, e.g., NOW). The MODEL is responsible for the rest, because the model manages the temporal dimension of the data about *ED*, and places (I believe) only minimal and reasonable demands on the end user. Clearly there is a burden to be placed on someone in the management of (temporal) data. There are roughly 3 places to place this burden and of course the burden is to be shared among them in some way: (a) on the DBMS, (b) on the UPDATER, and (c) on the QUERYer. I believe that it makes more sense to place a heavier burden on the UPDATER than on the QUERYer. The UPDATER presumably knows --- or should know --- quite a lot about what he/she intends to update, and it seems reasonable to require this. So, the UPDATER should have to know that *ED*'s Name in the database is currently "Ed" -- hardly an unreasonable burden. However, the QUERYer can reasonably be presumed to know little (or at least less) about the database, and is in fact posing a QUERY to learn more. So, it seems reasonable to design a model that REQUIRES AS LITTLE AS POSSIBLE of the QUERYer. Expecting the QUERYer to know at least something about *ED* --- like his Name at some point in time --- in order to learn more about him seems reasonable; expecting the QUERYer to know *ED*'s Name at every point in time does not. So, my recommendation is to modify the Proposed Date to include the following: Assume that the following "event" occurs in the real-world that we are modeling: *ED* changes his name on 1/1/88 to "Edward" We can assume that all of his other characteristics are as they are stated in the Proposed Data section. (This is also my answer to Christian's question (a) It would be of great help if someone (Jim?) could exemplify what is missing in the current instance. What can we add to the instance in order not to violate the AND ONLY criterion?) I believe that without such an addition, the Proposed Benchmark would not be "independent" of any existing model/language proposal, which is in fact one of its major goals. REFERENCES [CCT93] ``On Completeness of Query Languages for Grouped and Ungrouped Historical Data Models'', J. Clifford, A. Croker and A. Tuzhilin, in {\em Temporal Databases: Theory, Design, and Implementation}, 1993. I welcome your comments. --jim-- ************************************************************************ Jim Clifford jclifford@stern.nyu.edu Associate Professor TEL: (212) 998-0803 Department of Information Systems FAX: (212) 995-4228 Leonard N. Stern School of Business New York University Management Education Center 44 West 4th Street, Suite 9-74 New York, NY 10012-1126 ************************************************************************